class: center, middle, inverse, title-slide .title[ # Class 2b: Review of concepts in Probability and Statistics ] .author[ ### Business Forecasting ] --- <style type="text/css"> .remark-slide-content { font-size: 20px; } </style> --- layout: false class: inverse, middle # Summarizing Data ## Summary Statistics --- # Measures of Central Tendency ## Mean - **Mean** represents the arithmetic average of the data. - The population mean `\(\mu\)` is the sum of all observations divided by the total population size: `$$\mu =E(X)=\frac{\sum_{i=1}^{N} x_i}{N}$$` - where `\(N\)` is the total population size, and `\(x_i\)` are individual data points. - The sample mean, denoted as `\(\bar{x}\)`, is the sample equivalent: `$$\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1+x_2+...x_{n-1}+x_n}{n}$$` where `\(n\)` is the sample size. --- ## Mean Intuitively, mean is the balancing point of the distribution. <!-- --> --- ## Mean of a binary variable What is a mean of a **binary variable**? - Binary variable is a variable which takes value 0 or 1 - For example: do you have diabetes (yes=1, no=0) -- What is the intuitive interpretation of the mean of this variable? - `\(\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\)` - `\(\bar{x} = \frac{1+0+0+...0+1}{n}=\frac{n_{diabetes}}{n}=\hat{\mu}_{diabetes}\)` -- It's the proportion of people with diabetes in the sample: mean(diabetes)= 0.11 --- ## Weighted Mean - In some scenarios, data points have different weights. - For a dataset with weights `\(w_i\)` and values `\(x_i\)`, the weighted mean is: `$$\text{Weighted Mean} = \frac{\sum_{i=1}^{n} w_i \cdot x_i}{\sum_{i=1}^{n} w_i}$$`
The ** weighted mean** is: `\begin{align*} \bar{x} & =\frac{0.2\times 6+0.2\times 8+0.15 \times 9+ 0.15 \times 4+0.3 \times 8}{0.2+0.2+0.15+0.15+0.3} \end{align*}` --- ## Aggregated Data - We want to know average income in Mexico City. - But we only know averages by neighborhood, no individual data
--- ## Unweighted Mean vs. Weighted Mean - Unweighted mean is: 25916.67 USD -- - Weighted mean is: 21760.42USD -- - Which one reflects average population income in CDMX? -- `$$\mu = \frac{\sum_{i=1}^{N} x_i}{N}= \frac{\sum_{z}\sum_{i=1}^{N_z} x_i}{\sum_z N_z}= \frac{\sum_{z} \frac{N_z}{N_z} \sum_{i=1}^{N_z}x_i}{\sum_z N_z} = \frac{\sum_{z} N_z\sum_{i=1}^{N_z} \frac{x_i}{N_z}}{\sum_z N_z}=\frac{\sum_{z}{N_z} \bar{x}_z}{\sum_z N_z}$$` --- ## Mean - Is mean always a right measure? #### "Bill Gates walks into a bar" - Suppose a group of people, including Bill Gates, walks into a bar. - Let's say the net worth of everyone in the group is as follows: .pull-left[
] .pull-right[ The **mean** is: `\begin{align*} \bar{x} & =\frac{10 + 20 + 30 + 40 + 50 + 60000}{6} \\ & = 100025 \\ \end{align*}` Mean is seriously skewed due to the outlier. ] --- ## Mean vs Median <center> <img src=mean_median.jpg width="800"> </center> --- ## Median - **Median** represents the middle value when data is sorted - Half of observations are below it, half are above it. - For a dataset with odd size `\(n\)`, the median is the `\(\frac{n+1}{2}\)`-th value - For even size `\(n\)`, it's the average of `\(\frac{n}{2}\)`-th and `\(\frac{n}{2}+1\)`-th values. .pull-left[ | Day | Number of Customers | |-----|---------------------| | 1 | 20 | | 2 | 18 | | 3 | 25 | | 4 | 22 | | 5 | 30 | | 6 | 21 | | 7 | 27 | ] .pull-right[ The dataset has `\(n=7\)` (odd) observations, so to find the median: - Arrange the data in ascending order: - 18, 20, 21, 22, 25, 27, 30. - The median is the `\(\frac{n+1}{2}\)`-th value, which is the 4th value. - Thus, the median is the 4th value, which is 22. ] --- ### Let's look at the median weight in our population - Mean: 72.66451 - Median: 70.7536 -- <img src="data:image/png;base64,#C_2_slides_b_files/figure-html/unnamed-chunk-6-1.png" width="100%" /> - Mean is dotted - Median is dashed --- ### Median and outliers I added couple of observations on the right tail of the distribution - Old Mean: 72.66, **New Mean: 77.05** - Old Median: 70.75, **New Median: 70.95** <img src="data:image/png;base64,#C_2_slides_b_files/figure-html/abc-1.png" width="100%" /> --- ## Side note on the Mode **Mode** is the most frequent value in the data - Let's look at the distribution of age of people with diabtese
--- ## Mode <center> <img src=mode.jpg width="400"> </center> --- ## Percentiles - How much inventory of milk you need to keep in your Starbucks? -- - What is the tradeoff of keeping too much vs too litle inventory? -- - How much milk should we have, so we don't stock out on at least 95% of days? -- - To figure it out, let's look at the distribution of the daily use of milk <img src="data:image/png;base64,#C_2_slides_b_files/figure-html/Sales_dist_figure-1.png" width="100%" /> --- ## Percentiles - We want to choose amount `\(M\)`, such that `\(P(s_i<M)=0.95\)` - That is, in 95% of days sales are smaller than `\(M\)` -- <img src="data:image/png;base64,#C_2_slides_b_files/figure-html/Sales_dist_figure_with_shaded_region-1.png" width="100%" /> -- - What is this number? - It's the 95th percentile of the distribution (274 liters) --- ## Percentiles - *Percentiles* divide the ordered data into 100 equal parts. - `\(p\)`th percentile is a value such that `\(p\%\)` of the data are below it - `\(v_p\)` is such that `\(P(x_i<v_p)=p\)` - `\(v_{95}\)` is such that `\(P(x_i<v_{95})=95\%\)` --- ## How to find it in a sample 1. Arrange the data in ascending order -- 2. Find which observation corresponds to the relevant percentile - Formula: `\(i = \left(\frac{p}{100}\right)(n+1)\)` - Example: To find 95th percentile in a sample of 1000 observations we look at `\(i = \left(\frac{95}{100}\right)(1000+1)=950.95\)` observation -- 3. If it's an integer, value of ith observation is your percentile 4. If it's not, take the average between ith rounded down and ith rounded up In our example it would be the average of 950th and 951th observation --- ## Or use the CDF - `\(ECDF(v)=P(x_i<v)\)`
--- ## Common values - **Median** - 50th percentile - half of the values are below the median - **Quartiles** - 25th, 50th and 75th percentile. - How poor is the poorest quartile of the society? - Their income is below the 25th percentile <img src="data:image/png;base64,#C_2_slides_b_files/figure-html/Image_with_shaded_area_Qtiles-1.png" width="100%" /> --- - **Deciles** - 10th, 20th, ... 90th - How bad pollution gets in CDMX during top 10% polluted days? - During top 10% of polluted days pollution level is larger than 9th decile. <img src="data:image/png;base64,#C_2_slides_b_files/figure-html/Image_with_shaded_area_Deciles-1.png" width="100%" /> --- ### Example with data Here is a data on distribution of how many views have various tik-tok videos. - What is the 1st decile? - What is the 95th percentile?
-- - Index for the first decile is: `\(i = \left(\frac{10}{100}\right)(200+1)=20.1\)` - First decile is the average of the 20th and 21st observation - Index for the 95th percentile is: `\(i = \left(\frac{95}{100}\right)(200+1)=190.95\)` - 95th percentile is the average of the at 190th and 191st observation --- ### Example with data - What is the IQR?
--- ### Example with data Here is a (smaller) data on distribution of how many views have various tik-tok videos. - Suppose that all views triples and 1000 additional people viewed them as well `$$y_i=3x_i+1000$$` - What is new IQR?
-- - Order of observations was not affected $$q^{New}_{1}=3q^{Old}_1+1000 $$ -- - And more generally, for `$$y_i=bx_i+a$$` $$v^y_p=bv^{x}_p+a $$